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Abstract. —This review provides descriptions of the molecular biological techniques of PCR and 
DNA sequencing and the analysis of DNA sequence variability for discerning phylogenetic rela¬ 
tionships. There is a brief discussion of three general categories of phylogenetic analysis, including 
those based on pairwise distances, parsimony, and maximum likelihood. Lastly, techniques for 
assessing the robustness of phylogenetic hypotheses and for finding multiple islands of equally- 
parsimonious trees are reviewed. 


The field of molecular systematics has become dominated by two techniques 
used for the exploration of DNA sequence variability: Restriction site mapping 
and DNA sequencing. The most commonly used phylogenetic methods em¬ 
ployed for analyzing molecular data sets include those based on pairwise dis¬ 
tances between taxa, parsimony, and maximum likelihood (see Swofford and 
Olsen, 1990, for a thorough description of a number of analytical techniques). 
In practice, parsimony has become the favored technique, although several 
recent studies have shown that other methods may be just as effective at find¬ 
ing the true phylogeny of a group under a variety of conditions (e.g., Huelsen- 
beck and Hillis, 1993). The explanations and discussions presented herein are 
primarily intended for those not familiar with the generation of molecular data 
and their phylogenetic analysis. 

Because the primary type of data employed in the molecular-based studies 
presented in the current symposium is that of DNA sequences, the focus of 
the present contribution will be to describe the DNA sequencing technique 
and to discuss several analytical problems often encountered with sequence 
data (or any large data set) and some possible solutions. Brief descriptions of 
the phylogenetic analyses based on parsimony, maximum likelihood, and pair¬ 
wise genetic distances will be provided. For a recent detailed comparison of 
restriction site mapping and sequence data see Olmstead and Palmer (1994). 
Several recent reviews provide broad overviews of the results of molecular 
phylogenetic studies of plants (e.g., Palmer et al., 1988; Soltis et al., 1992; 
Clegg, 1993; Doyle, 1993; Sytsma and Hahn, 1994; Soltis and Soltis, 1995). 

Obtaining Template DNA for Sequencing: 

The Polymerase Chain Reaction 

The most commonly used technique for isolating a particular region of DNA 
for sequencing from an extract of total cellular DNA is the polymerase chain 
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reaction (PCR; Mullis, 1990; Saiki et a!., 1985). PCR is a method of preferen¬ 
tially synthesizing millions of copies of a particular region of interest such that 
sufficient quantities are available for sequencing reactions; this procedure is 
referred to as PCR amplification. To understand PCR and the technique of DNA 
sequencing, one must have a basic knowledge of the structure of DNA and the 
process of DNA replication. The building blocks of DNA molecules are 2'- 
deoxyribonucleotides, each of which consists of a nitrogenous base (i.e., ade¬ 
nine, cytosine, guanine, or thymine), the sugar 2'-deoxyribose, and a phos¬ 
phate group. DNA molecules comprise two strands of nucleotides that are held 
together by hydrogen bonds. The ends of a single strand of DNA are called the 
“5' end” (read “5-prime-end”) and the “3' end”, which refer to the location 
of the chemical bonds between adjacent nucleotides with respect to the num¬ 
bered carbons in the deoxyribose molecules. Specifically, adjacent nucleotides 
in DNA molecules are joined via a phosphate bond between the 3' carbon of 
the “upstream” nucleotide (relative to the direction of synthesis; see below) 
and the 5' carbon of the “downstream” nucleotide (Fig. 1). The terminal 3' 
end of a DNA strand will have a free hydroxyl group (OH) bonded to the 3' 
carbon. 

For successful synthesis of a new single strand of DNA in vitro, an existing 
DNA double-stranded molecule must be present to serve as a template. The 
double-stranded template molecule must first be heated to 94-96°C to denature 
the DNA (i.e., to break the hydrogen bonds holding the strands together) and 
produce single strands. The single strands must be “primed” before synthesis 
can begin; this means that a short segment of complementary (i.e., A pairs 
with T, G with C) single-stranded DNA (typically 15 to 30 nucleotides) must 
be annealed to the template. A new strand will be synthesized in the presence 
of a polymerase enzyme by sequential incorporation of nucleotides onto the 
available 3' end of the growing molecule. Thus, the molecule being synthe¬ 
sized grows from its 5' end towards its 3' end, which runs in the opposite 
direction of the existing template strand. It is important to note that the ad¬ 
dition of a new nucleotide through the formation of a phosphate bond will 
only happen if the open 3' end ol the molecule has a free hydroxyl group (f ig. 
1). Also note that the actual molecules used in the reaction are deoxynucleo- 
side triphosphates (dNTPs), from which two phosphate groups (a pyrophos¬ 
phate) are cleaved during the formation of the phosphate bond that adjoins 
nucleotides. All four dNTPs (dATP, dCTP, dGTP, and dTTP) must be present 
for synthesis to proceed. 

PCR relies on the general process of DNA synthesis conducted in vitro; how¬ 
ever, the critical difference is that two primers are needed for PCR that delimit 
a region of interest. The primers used for PGR amplification possess several 
characteristics: 1) they are typically from 15 to 30 nucleotides long; 2) they 
are designed such that they flank the gene or region of interest (Fig. 2); and, 
3) they are complementary to opposite strands of the double-stranded DNA 
molecule. Initial PCJR reactions usually contain extracts of total cellular DNA 
from a single organism, a highly thermal-stable DNA polymerase, magnesium 


chloride (required as a co-factor by the polymerase), a buffer, all four dNTPs, 
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Fig. 1 . Depiction of the a new nucleotide being incorporated onto the 3' end of a growing DNA 
molecule. PPi = pyrophosphate. Note that if the 3'-hydroxyl group were missing, as in the case 
of a ddNTP (see text), the polymerization process would not proceed. (Modified from the DNA 
Sequencing Guide, United States Biochemical Corporation.) 


and two primers. A reaction mixture is initially heated to 94-96°C to destroy 
any heat-sensitive contaminants (i.e., proteins) and to denature the sample 
DNA (Fig. 2A). Amplification is dependent upon the repetition of a thermal 
cycle consisting of three steps, each of which is characterized by a particular 
temperature and duration and is designed to allow a different stage of DNA 
synthesis to proceed. After the initial denaturation step, the temperature is 
lowered to a point that typically favors the reassociation of complementary 
single strands of DNA; this is usually between 37°C and 55°C (Fig. 2B). In a 
reaction mixture, some of the original complementary strands of the sample 
DNA will reanneal, but sometimes the primers in the reaction will anneal to 
regions of complementarity on the sample DNA (if such complementarity ex¬ 
ists) before reassociation of original strands can occur. Next, the temperature 
is increased to a point suitable for DNA synthesis (or “extension”) by the 
polymerase, typically 72°C (Fig. 2C). During this part of the cycle, new nucle¬ 
otides, which are complementary to the sample DNA serving as a template, 
are added onto the 3' end of the annealed primer, and the molecule continues 
to grow in the 5' to 3' direction (Figs. 1 and 2C). This three-part (denaturation, 
primer annealing, chain extension) cycle is repeated 30 to 40 times. Note that 
after the second cycle (Fig. 2F), some of the newly produced strands are only 
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Fig. 2. Depiction of PCR. A) denaturation step at 95° C; B) primer annealing at 37-50 C; C) primer 
extension ( = synthesis) at 72° C; D) second round of denaturation and primer annealing; E) second 
round of primer extension; F) Denaturation after second round of PCR resulting in first products 
of target region (indicated by arrows). (Modified from information supplied by Cetus Corporation.) 
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of the length delimited by the two primers, and that, because the amount of 
this sequence produced will double after each thermal cycle, a theoretical 
amplification factor of about one billion is attained after 30 cycles. 

DNA Sequencing 

The most commonly used technique to obtain DNA sequences is the chain- 
termination technique described by Sanger et al. (1977), often referred to as 
“Sanger Dideoxy Sequencing”. Chain-termination DNA sequencing takes ad¬ 
vantage of the fact that if a free hydroxyl group is absent from the terminal 3' 
carbon on a DNA molecule, a new nucleotide cannot be incorporated and the 
synthesis of the new molecule is terminated (see Fig. 1). Nucleotides that lack 
hydroxyl groups at both the 2' and 3' positions of the sugar molecule are 
referred to as 2',3'-dideoxynucleotides (ddNTPs). Thus, if we were to set up a 
reaction that contained DNA polymerase, many copies of a template DNA, 
many copies of a particular primer, sufficient quantities of all four dNTPs, and 
many copies of one particular ddNTP (e.g., ddATP) and if we assume that the 
incorporation of a ddNTP and its dNTP analog (i.e., ddATP and dATP) were 
equally likely events, we would expect a population of new DNA molecules 
to be produced with the following characteristics: All new molecules would 
have the same 5' end (i.e., the 5' end of the primer) but they would be of 
varying lengths and each would have a 3' end with a ddATP molecule (Fig. 
3). If, at the same time, we set up three additional reactions with the same 
template DNA, primer, and all four dNTPs, but each with a different ddNTP, 
when we examined the products across all four reactions, we would expect to 
find molecules of all possible lengths beginning with the nucleotide at the 5' 
end of the primer and ending at the 3' end with the ddNTP that was supplied 
in the particular reaction tube (Fig. 3). The products of each of these four 
“termination” reactions could be separated electrophoretically based on size 
under the influence of an electric current in a polyacrylamide gel, with the 
smallest molecules running the fastest and appearing at the bottom of the gel 
and with successively larger molecules appearing higher on the gel (Fig. 4). 
Because a radioactively-labeled dATP is typically included in each termination 
reaction, the products of these reactions are most commonly visualized by 
autoradiography. Autoradiography is a process wherein the gel containing the 
DNA fragments is placed against a sheet of x-ray film in a light-proof cassette 
for 12 hours to several days. The radioactively-labeled fragments expose the 
film leaving dark bands (Fig. 4). As shown in Fig. 4, we can obtain the DNA 
sequence by reading up the “autorad”, reading successively larger fragments 
within and across lanes in the 5' to 3' direction; the sequence read is the 
complement of the template DNA used in the reactions. 

Reconstructing Phylogenetic History 

For detailed discussions of molecular data and methods of phylogenetic re¬ 
construction, see Felsenstein (1988), Swofford and Olsen (1990), and Miya¬ 
moto and Cracraft (1991). Only a cursory discussion will be presented here. 
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OH-3’ 


G C T A 


Primed Template DNA 


C GA TCGA T CGAT CGAT CGATCGAT 





dNTPs + ddATP 



GCTAGCTA 

CGATCGATCGATCGATCGATCGAT 




Fic;. 3. Diagrammatic representation of chain termination sequencing with ddATP. All possible 
products ending in a ddATP would be expected in the reaction mixture. (Modified from the DNA 
Sequencing Guide, United States Biochemical Corporation.) 


Most recent studies that have attempted to reconstruct the phylogenetic his¬ 
tory of a group of plants have employed the principle of maximum parsimony, 
which states that simpler hypotheses are favored over more complex ones (e.g., 
Sober, 1983, 1989). Parsimony is realized in phylogenetic reconstruction by 
favoring trees of minimal length over longer trees. Tree length is simply the 
number of changes (or “steps”) from one character state to another that must 
be hypothesized on any tree that depicts relationships among taxa to tit the 
data (Swofford and Olsen, 1990). 

I’he maximum likelihood method of phylogenetic reconstruction attempts 
to find the single tree which “. . . yields the highest probability of evolving the 
observed data . . (Felsenstein, 1981) under a given model of evolution. For 
example, the Jukes-Cantor model (fukes and Cantor, 1969) assumes (among 
other assumptions) that all lour nucleotides are equally frequent and all sub- 
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Fig. 4. Diagrammatic: representation of an autoradiographic image produced by exposure to a 
DNA sequencing gel. (Modified from the DNA Sequencing Guide, United States Biochemical Cor¬ 
poration.) 


stitution types are equally likely. Other, more complex (and usually more re¬ 
alistic), models of molecular change have been proposed (see Swofford and 
Olsen, 1990, and references therein for a review). Thus, in comparing alter¬ 
native phylogenetic trees of a given set of DNA sequences and given a partic¬ 
ular model of evolutionary change, one could calculate the relative likelihood 
of each tree producing the data. Detailed descriptions of the maximum like¬ 
lihood technique are provided in Felsenstein (1981, 1988) and Swofford and 
Olsen (1990). 

With distance methods one calculates some sort of genetic distance between 
pairs of sequences and constructs a pairwise matrix of these distances between 
all such pairs. For example, for DNA sequence data one could calculate the 
fraction of sites different between two sequences (Felsenstein, 1988). A tree is 
constructed by first joining the least distant sequences as sister sequences and 
other sequences are added onto the growing tree in a step-wise fashion until 
all sequences are on the tree. Numerous algorithms have been devised to con¬ 
struct phylogenetic trees based on distance methods (see Swofford and Olsen, 
1990). 

Huelsenbeck and Hillis (1993) conducted a large computer simulation study 
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Table 1. Number of distinct unrooted trees for three or more taxa. 


Number of taxa 

Number of unrooted trees 

3 

i 

4 

3 

5 

15 

6 

105 

7 

943 

8 

10,395 

9 

135,135 

10 

2,027,025 

1 1 

34,495.425 

12 

654,729,075 

20 

>221 billion 


billion 


to compare the effectiveness of a variety of phylogenetic analytical techniques 
for finding known phylogenies from artificial molecular data sets of varying 
size and under several different models of molecular evolution. Their results 
did not provide simple answers in that one or a subset of techniques did not 
always perform best. The degree of accuracy of any technique depended upon 
the particular model of evolution employed and the relative divergence of the 
hypothetical DNA sequences. 

To appreciate the potentially daunting nature of the task of phylogenetic 
reconstruction, consider the number of possible trees that can be generated 
from a given number of taxa (Table 1; see Felsenstein, 1978). With only three 
taxa, there is only one possible way to join them in an unrooted tree (i.e., a 
tree in which no direction of evolution is implied). As the number of taxa 
increases, however, the number of possible unrooted trees increases dramati¬ 
cally. For example, with only 12 taxa, the number of possible unrooted trees 
is over half a billion and with 20 taxa that number is over 221 billion, billion. 
When we attempt to root the tree (that is, to discern evolutionary direction¬ 
ality), the number of possible trees is increased by a factor of twice the number 
of taxa minus 3. For example, with only 3 taxa the number of possible trees 
increases from 1 to 3 and with 12 taxa the number of possible trees increases 
to 654,729,075 X 21. Given the possibility of so many trees, even if we find 
one, shortest, most parsimonious tree for a given set of taxa, we must ask: Is 
this tree significantly different from all other possible trees? Hillis and Huel- 
senbeck (Hillis, 1991; Hillis and Huelsenbeck, 1992; Huelsenbeck, 1991) have 
suggested that one can use the degree of skewness of the distribution of lengths 
of randomly-generated trees from a given data set as a measure of the phylo¬ 
genetic information content of that data set. The logic here is that if the dis¬ 
tribution of random trees is strongly skewed to the left, as shown in the ex¬ 
ample in Fig. 5, then there will be very few random trees close in length to 
the shortest, most parsimonious tree(s). By contrast, if the distribution of ran¬ 
dom trees is normally distributed or skewed to the right, then the number of 
random trees close in length to the most parsimonious tree(s) will be greater, 
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Fig. 5. Plot of the frequency distribution of 1000 randomly-generated trees from the rbcL se¬ 
quence data set of the Polypodium vulgare group from Haufler and Ranker (1995). 


making the shortest tree(s) less distinct and, perhaps, less informative than 
those that were randomly generated. Hillis and Huelsenbeck (1992) have 
shown with computer-generated data and with experimentally-derived phy- 
logenies that the probability of finding the true phytogeny of a group increases 
as the leftward skewness of the distribution of random trees increases. The 
process of plotting the lengths of randomly-generated trees from a data set 
(molecular or otherwise) and testing for significance of skewness is becoming 
a commonly used technique to assess the phylogenetic informativeness of 
one’s data. The statistic employed to test for the significance of leftward skew¬ 
ness is the g, statistic, and Hillis and Huelsenbeck (1992) provided tables of 
critical values of g, for binary and four-state character data. It should be noted 
that Kallersjo et al. (1992) presented arguments suggesting that significant 
skewness of the distribution of randomly generated trees may only indicate 
non-random structure of a data set rather than phylogenetically-informative 
structure. 

A commonly used measure to assess the strength of a phylogenetic hypoth¬ 
esis is the decay index (Bremer, 1988; Donoghue et al., 1992). This was de¬ 
scribed by Donoghue et al. (1992) as “. . . the number of steps that must be 
added before each clade present in the minimum length tree is no longer un¬ 
equivocally supported ...” Decay indices are calculated by sequentially saving 
all trees one step longer than the shortest tree(s), two steps longer, three steps 
longer, etc. At each step, strict consensus trees are constructed (i.e., trees in 
which branches are shown as dichotomously resolved that appear so in all 
equally-parsimonious trees). The procedure is continued until no branches 
appear as dichotomously-resolved (i.e., the entire tree is an unresolved 
“bush”). Thus, the higher the decay index, the more support there is for a 
given dichotomously-resolved branch. 

Lastly, I will discuss the potential problem of the existence of what are called 
“multiple islands of equally parsimonious trees” (Maddison, 1991). As dis¬ 
cussed above, once we start dealing with about 10 taxa, the number of possible 
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trees increases rapidly as even more taxa are added to an analysis (Table 1). 
With the combination of a large number of taxa and a large number of char¬ 
acters (as we typically have with DNA sequence data), one quickly surpasses 
the ability of most readily available desktop computers to conduct an exhaus¬ 
tive search in a timely fashion, wherein all possible tree topologies are ex¬ 
amined in an effort to find the most parsimonious trees. Willi large data sets 
(i.e., greater than 20 taxa), we are forced to resort to non-exhaustive searches, 
or so-called heuristic methods, that involve branch swapping (see Swofford 
and Olsen, 1990). Maddison (1991) found that in the process of conducting a 
less than exhaustive search, there could exist what he called “multiple islands 
of most parsimonious trees." Each island consists of one or more most parsi¬ 
monious trees. When multiple trees exist on a single island they differ from 
each other by only one branch rearrangement or, if they differ by more than 
one, are connected by a series of intermediate trees that only differ by one 
branch rearrangement. By contrast, members of different islands always differ 
by more than one branch rearrangement (see Fig. 3 of Maddison, 1991). The 
analytical problem is that if a search is begun with only one initial tree ar¬ 
rangement and the computer algorithm only saves new trees of equal or shorter 
length (found by branch swapping), then the search may only find the trees 
on one island but not on the others. The problem can be solved by conducting 
multiple searches wherein different tree topologies are used as initial trees in 
each repeated search. Olmstead and Palmer (1994) provide an outline of 
searching strategies for finding multiple islands with large data sets. 
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